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Abstract This paper deals with the clustering of univariate observations: 
given a set of observations coming from K possible clusters, one has to esti- 
mate the cluster means. We propose an algorithm based on the minimization 
of the "KP" criterion we introduced in a previous work. In this paper, we 
show that the global minimum of this criterion can be reached by first solv- 
ing a linear system then calculating the roots of some polynomial of order K. 
The KP global minimum provides a first raw estimate of the cluster means, 
and a final clustering step enables to recover the cluster means. Our method's 
relevance and superiority to the Expectation-Maximization algorithm is illus- 
trated through simulations of various Gaussian mixtures. 

Keywords unsupervised clustering • non-iterative algorithm • optimization 
criterion • univariate observations 



1 Introduction 

In this paper we focus on the clustering of univariate observations coming 
from K possible clusters, when the number of clusters is known. One method 
consists in estimating the observation pdf (mixture of K pdf), by associating 
a kernel to each observation and adding the contribution of all the kernels 
(Parzcn 1962). A search of the pdf modes then leads to the cluster means. 
The drawback of such method is that it requires the configuration of extra- 
parameters (kernel design, intervals for the mode search). Alternately, the 
Expectation-Maximization (EM) (Dempster et al. 1977) algorithm is the most 
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commonly used method when the mixture densities belong to the same known 
parameterized family. It is an iterative algorithm that look for the mixture pa- 
rameters that maximize the likelihood of the observations. Each EM iteration 
consists of two steps. The Expectation step estimates the probability for each 
observation to come from each mixture component. Then, during the Maxi- 
mization step, these estimated probabilities are used to update the estimation 
of the mixture parameters. One can show that this procedure converges to one 
maximum (local or global) of the likelihood (Dempster et al. 1977). If the mix- 
ture components do not belong to a common and known parameterized family, 
the EM algorithm does not directly apply. Yet, if the component densities do 
not overlap too much, some clustering methods can be used to cluster the data 
and calculate the cluster means: in (Fisher 1958) an algorithm is proposed to 
compute the if-partition of the N sorted observations which minimize the 
sum of the squares within clusters. Instead of testing the (^Zj) possible parti- 
tions, some relationships between /c-partitions and (k + l)-partitions are used 
to recursively compute the optimal if-partition. The main drawbacks of this 
method are a high sensitivity to potential differences between the cluster vari- 
ances and a complexity in 0(KN 2 ) (Fitzgibbon 2000). Among the clustering 
methods, the K-Means algorithm (Hartigan 1977) is one of the most popular 
method. It is an iterative algorithm which groups the data into K clusters in 
order to minimize an objective function such as the sum of point to cluster 
mean square Euclidean distance. The main drawback of K-Means or EM is the 
potential convergence to some local extrema of the criterion they use. Some 
solutions consist for instance in using smart initializations (see (McLachlan 
and Peel 2000) and (Lindsay and Furman 1994) for EM, (Bradley and Fayyad 
1998) for k-means) or stochastic optimization, to become less sensitive in the 
initialization (see (Celeux et al. 1995) and (Pernkopf and Bouchaffra 2005) 
for EM, (Krishna and Murty 1999) for K-means). Another drawback of these 
methods is the convergence speed, which can be very slow when the number 
of observations is high. A survey of the clustering techniques can be found in 
(Bcrkhin 2006) and (Xu and Wunsch 2005). In this contribution, we propose 
a non-iterative algorithm which mainly consists in calculating the minimum 
of the "K-Product" (KP) criterion we first introduced in (Paul et al. 2006): 
if {z n } n £{i...N} is a set of N observations, K the known number of clusters 
and {xk}ke{i - K} any vector of R K , we define the KP criterion as the sum 
of all the K-tcrms products JlfcLi^™ — x k) 2 ■ The main motivation for us- 
ing such criterion is that, though it provides a slightly biased estimation of 
the cluster means, its global minimum can be reached by first solving a lin- 
ear system then calculating the roots of some polynomial of order K. Once 
these K roots have been obtained, a final clustering step assigns each obser- 
vation to the closest root and calculates the resulting cluster means. Another 
advantage of the proposed method is that it does not require the configura- 
tion of any extra-parameters. The rest of the paper is organized as follow: In 
section 2 the observation model is presented and the criterion is defined. In 
section 3 the criterion global minimum is theoretically calculated. In section 
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4 the clusters estimation algorithm is described. Section 5 presents simulation 
results which illustrate the algorithm performances on different Gaussian mix- 
tures: mixtures of three, six and nine components have been simulated with 
various configurations (common/different mixing weights, common/different 
variances). Conclusions arc finally given in Section 6. 

2 Observation model and criterion definition 

Let {ak}ke{i - K} be a set of K different values of M , let a be the vector of 

cluster means defined by a = (ax, a,2 ■ ■ ■ clk) 1 , let {TTk}ke{i - K} be a set of K 
mixing weights (prior probabilities) that sum up to one and let {gk}k£{i - K} 
be a set of K zeros-mean densities. The probability density function of the 
multimodal observation z is a finite mixture given by: 

K 

f(z) = ^2ir k 9k{z - flfe) 

Note that the form of the densities g k are usually not known by the estimator 
and that the g k do not necessarily belong to the same parameterized family. 
Now let {^n}ne{i-jv} be a set of N observations in Mr. In all the following we 
assume that N is greater than K and that the number of different observations 
is greater than K — 1. The KP criterion J(x) is defined by: 

N K 

J : R K - M+ : x - £ J] (z n - x k f (1) 

n=l k=l 

Note the difference with the K-means criterion which can be written (for the 
square Euclidean distance): 

JV 

K-means : R K — > R + : x — > min (z n ~x k ) 2 

The KP criterion (|TJ) is clearly positive for any vector x. The first intuitive 
motivation for defining this criterion is its asymptotic behavior when all the g k 
variances are null. In this case, all the observations are equal to one of the a k 
and therefore J (a) = 0. J(x) is then minimal when x is equal to a or any of its 
K\ permutations. The second motivation is that, in the general case, J have 
K\ minima that are the K\ permutations of one single vector which can be 
reached by solving a linear system then finding the roots of some polynomial 
of order K . This is shown in section 3. 

3 KP global minimum 

We first give in section 13.11 some useful definitions which are needed in section 
13.21 to reach the global minimum of J. 



4 



3.1 Some useful definitions 

To any observation z n we associate the vector z„ defined by: 

z„^-\^- 2 -..,1)*, z„eM* (2) 
The vector z and the Hankel matrix Z are then respectively defined by: 



JV 

A 

z 

n=l 
N 



= 5>*z n , zeR K (3) 



z = E^L ZeK M (4) 

n=l 

The matrix Z is regular if the number of different observations is greater than 
K — 1 (one explanation is detailed in Appendix A). 

Now let y = (t/i, • • • , yjtf be a vector of R K . We define the polynomial of 
order K q y (a) as: 

K 

q y (a)^a K -J2* K ~ k yk (5) 
fe=l 

if r = (n, • • • , rjf)' is a vector of C K containing the X roots of q y (a) the 
factorial form of q y (a) is: 

if 

Qy( a ) = Y[( a r k) 
k=i 

q y (a) = a K - (n + • • • + )a x " 1 + ... + (-l) K (ri x r 2 • • • x r K ) 

K 

q y {a) = a K - ^ /"'w^r) 
fc=i 

where u>fc(r) is the Elementary Symmetric Polynomial (ESP) in the variables 
ri, • • • , t\r- defined by: 

^(r) ^ (-l)^ 1 ^ ^.r, 2 ....r, fc (6) 

! <: •••• ./• :• 

/ jl k 

For instance, for K = 3, we have: 

tui(r) = n + r 2 + r 3 
w 2 (r) = -(rir 2 + r 2 r 3 + nr 3 ) 
103 (r) = rir 2 r 3 
If we call w(r) the vector of ESP of r defined by: 

w(r) = (ttfi(r),--- ,iOi{(r))' (7) 
the relationship between the roots and coefficients of q y (a) becomes: 

y = w(r)^Vke{l---K} q y (r k )=0 (8) 
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Table 1 KP algorithm steps and complexities 



step 1: calculate a minimum of J 



calculate Z and z: 0(NK) 
calculate y m ^ n by solving Z.y min = z: 0(K 2 ) 
calculate the roots (ii im ,:„, ■ • • ,XK,min) of qy miri (a): 0(K 2 ) 
step 2: clustering and cluster means estimation 

assign each z n to the closest m i„: O(NK) 
calculate the K means of the resulting clusters: O(N) 



3.2 The KP minimum 

The global minimum of J is given by theorem 1: 

Theorem 1 if y min is the solution of Z.y min — z (where z and Z have been 
defined in ([3]) and (|4]) ) and if x m i n is a vector containing, in any order, the 
K roots of q Vmin (a) (defined in )■ then x min belongs to R K and x min is the 
global minimum of J. 

The proof is given in appendix B. 
4 Clusters estimation algorithm 

The clusters estimation algorithm consists of two steps. In the first step, the 
minimum of J, )*, is calculated, giving a first raw 

estimation of the set of cluster means. This first estimate is slighly biased: for 
instance, for a Gaussian mixture with two balanced components centred on 
—a and a and a common standard deviation a the asymptotical solution of 
Z-y m i„ = z is y min = (0, a 2 + <7 2 )* and the roots of q ymin (a) are: 



Therefore, in a second step, each observation z n is assigned to the nearest 
Xk,min, K clusters are formed, and the final estimated cluster means are cal- 
culated. The algorithm steps and their complexities are illustrated in table [T] 
The total complexity is in 0(NK + K 2 ), which is equivalent to O(NK) since 
N is greater than K. 

5 Simulations 

Several types of Gaussian mixture have been considered. The number of com- 
ponents (clusters) is equal to three (scenario A), six (scenario B) and nine 
(scenario C). In scenario A, the distance between two successive cluster cen- 
ters (component means) is equal to one. In scenario B and in scenario C, the 
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Table 2 simulation scenario A 





scenario A.l 


scenario A. 2 


scenario A. 3 


scenario A. 4 


means 


variances prior 


variances prior 


variances prior 


variances prior 









a 2 0.4 


a 2 0.4 


1 


a 2 | 




a 2 0.4 


IT °- 4 


2 


1 


k 


o- 2 0.2 


cr 2 0.2 



Table 3 simulation scenario B 





scenario 


B.l 


scenario 


B.2 


scenario 


B.3 


scenario 


B.4 


means 


variances 


prior 


variances 


prior 


variances 


prior 


variances 


prior 





a 2 


1 

6 


a 2 


1 

6 


a 2 


0.2 


a 2 


0.2 


1 


a 2 


1 
6 


<T 2 

2 


1 

6 


a 2 


0.2 


a 1 
2 


0.2 


2 


a 2 


1 

6 


a 2 


1 

6 


a 2 


0.1 


a 2 


0.1 


4 


a 2 


1 
6 


2 


1 

6 


a 2 


0.2 


2 


0.2 


5 


a 2 


1 

6 


a 2 


1 

6 


a 2 


0.2 


a 2 


0.2 


6 


a 2 


1 

6 


<T 2 

2 


1 

6 


a 2 


0.1 


a 1 
2 


0.1 



distance between two successive centers is equal to one or two. For each sce- 
nario "X" , four cases have been studied: common variance and common mixing 
weight (scenario X.l), different variances and common mixing weight (scenario 
X.2), common variance and different mixing weights (scenario X.3) and dif- 
ferent variances and different mixing weights (scenario X.4). A summary of 
all the scenari is given in Tables 2, 3 and 4. The number of observations (N) 
per simulation run is equal to 100 in scenario A, 200 in scenario B and 300 in 
scenario C. 

To evaluate the performances of our proposal we compare it to the classical 
EM algorithm. To estimate the parameters of a Gaussian mixtures, the EM 
algorithm proceeds as follows (Dempster et al. 1977): if (3^^ is the estimed 

probability that z n comes from cluster k and if n^ te \ d^* e ' and are 
respectively the estimated prior, mean and standard deviation of cluster k at 
iteration ite, then the estimations at iteration ite + 1 are given by: 
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Table 4 simulation scenario C 





scenario 


C.l 


scenario 


C.2 


scenario 


C.3 


scenario 


C.4 


means 


variances 


prior 


variances 


prior 


variances 


prior 


variances 


prior 





a 2 


i 

9 


a 2 


i 
9 


a 2 


2 

15 


a 2 


2 

15 


1 


a 


1 

9 


— 


1 

9 




IB" 


— 


2 
15 


2 


a 2 


1 

9 


a 2 


1 
9 


a 2 


1 

15 


a 2 


1 

15 


4 


a 2 


1 

9 


a 2 


1 

9 


a 2 


1 

15 


a 2 


1 

15 


5 


a 2 


1 
9 


a' 2 

2 


1 
9 


a 2 


3 
15 


a 1 
2 


3 
15 


6 


a 2 


1 

9 


a 2 


1 

9 


a 2 


1 

15 


a 2 


1 

15 


8 


a 2 


1 

9 


a 2 


1 

9 


a 2 


2 

15 


a 2 


2 

15 


9 


a 2 


1 

9 


e 2 

2 


1 

9 


a 2 


2 
15 


a 2 

2 


2 

15 


10 


a 2 


1 

9 


a 2 


1 

9 


a 2 


1 

15 


a 2 


1 

15 



Expectation step: 



, (ite) / 

1 l 



A(*te+1) 
Pn,k 



Maximization step: 




- (ite+l) 
°k 



K 



fe=i ^ 2 ™k 



n=l Pn,k 
n k - — 

Mite+1) 
,(ite+l) _ Z^n=l ' J nM 
Jfe ~~ v-^JV o(ite+l) 

2—m=\ Pn.k 

Mte+1) ( _.(ite+l)V 
Z^n=l l J n,k \fn "fc ^ 

y^JV A(ite+1) 



This iterative procedure converges to one maximum of the likelihood function 

AT 

n P(;J n | {afc(Tfc7r fc } fcg | 1 ... K j). To initialize the EM algorithm in our Simula- 
nt 

tions, if cluster means a k °^ are randomly chosen with a uniform draw in the 
observation zone [min(z„) max(z n )]. For each n, is set to one if is 

the closest cluster means to the observation z n and p£' k is set to zero oth- 
erwise. This initialization is repeated until each cluster contains at least one 
observation. Then the EM starts with a maximization step. The algorithm is 
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stopped if all the estimated parameters do not change between two EM steps 
or if a maximal number of 100 iterations is reached. 

The clustering performances are evaluated as follows: to get rid of the 
permutation ambiguity, for each simulation run r and estimation a r , the per- 
formance criterion e r is defined as the maximal absolute distance between the 
true and estimated sorted vector of cluster means: 

e r = N (sort (a) — sort(a r )) 

where N(x) = max \xiA- 
ke{i-K}' 

The distribution of e r is given in figure 1 for the scenario A.l with a = 0.25 
and 10000 simulation run. The KP minimum is a biased estimation: e r is 
greater than 0.1 for 90% of the run. Yet, e r remains less than 0.2 for 80% of the 
run. Then the "full KP" algorithm (calculation of the KP minimum followed 
by a clustering) always provides an accurate set of estimates: e r remains less 
than 0.1 (resp 0.2) for 80% (resp. 100%) of the run. With the EM algorithm, e r 
is less than 0.1 (resp. 0.2) for 45% (resp. 65%) of the run but e r is greater than 
0.5 for 30% of the run. In this case, the EM gets stuck at a local maximum of 
the likelihood. Typically one estimated cluster mean is located in the middle 
of two true cluster means (assuming a too high variance), while two other 
estimated cluster means are closed to the same true cluster mean. In figure 
2, 3 and 4 we present the EM and KP performances for all the scenari with 
different values of a. In scenario A (scenari A.l to A. 4) the KP algorithm 
estimation is perfect when o is less than 0.2 (e r is less than 0.1 for 95% of 
the run) and remains correct for a < 0.3 (e r is less than 0.2 for 95% of the 
run). On the contrary, EM can provide a wrong set of estimated clusters as 
soon as a is not null: for instance, when a — 0.1, e r is greater than 0.5 for 
25% of the run. When the mixture components strongly overlap (er > 0.5) 
the two methods lead to wrong estimations, with a slight superiority of EM 
when a > 0.8. The KP algorithm remains superior to EM in scenario B and 
in scenario C. In scenario B (6 clusters), KP is robust for any a less than 0.15, 
while, for a = 0.1, EM converges to a wrong set of cluster means for 75% of 
the run. in scenario C (9 clusters), KP is robust for any o less than 0.05, while, 
for a = 0.02, EM converges to a wrong set of cluster means for 70% of the run. 
In each scenario, the mixing weights configuration (balanced/unbalanced) has 
a slight influence on the KP algorithm: the performances on sub-scenari X.3 
and X.4 (different mixing weights) are weaker than the performances on sub- 
scenari X.l and X.2 (common mixing weights). Yet the KP performances on 
the unbalanced mixtures remain strongly greater than the EM performances. 

6 Conclusion 

We have proposed a clusters estimation algorithm for univariate observations 
when the number of clusters is known. It is based on the minimization of the 
"KP" criterion we first introduced in (Paul et al. 2006). We have shown that 
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performances on scenario A.1 with sigma=0.25 




0.2 0.4 0.6 0.8 1 1.2 



Fig. 1 Estimation performances on scenario A.l with a = 0.25. 10000 simulation run have 
been performed. For each simulation run, 100 observations have been generated. e r is the 
maximal distance between the sorted vector of true cluster means and the sorted vector of 
estimated cluster means 

the global minimum of this criterion can be reached with a linear least square 
minimization followed by a roots finding algorithm. This minimum is used 
to get a first raw estimation of the cluster means, and a final clustering step 
enables to recover the cluster means. The proposed method is not iterative, its 
complexity is in 0(NK + K 2 ) and it does not require the configuration of any 
extra parameter. Simulations have illustrated the KP algorithm performances 
and superiority to the Expectation-Maximization algorithm which can get 
stuck at a local maximum of the likelihood. We focused on the univariate case 
and our current researchs deal with the multivariate case. If the observations 
z n belong to K d , if {xk}ke{i - K} is any set of K vectors of E d , the KP criterion 

is now defined as the sum of all the K-terms products Ylk=i ll z « ~~ x kllR<»- The 
minima of such criterion and some algorithms to reach them arc currently 
being studied. 
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manuscript. 
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A : non-singularity of Z 

In appendix A we explain why the matrix Z of size K X K, defined in is regular if the 
number of different observations is greater than K — 1. Z can be written as the following 
matrix product: 

Z = VV* 

where V is a K X N Vandermonde Matrix defined by: 

V = (zi, Z2, ■ ■ ■ zjv) 

and z n has been defined in J2}. Let us assume that the K first observations are different. The 
determinant of the KxK Vandermonde matrix (zi , Z2 , • ■ ■ zjy ) is equal to V{ {zj — Zi), 

l<i<j<K 

which is different from zero. The rank of V is then equal to K, so the rank of Z is equal to 
K and Z is regular. 



B : proof of theorem 1 

In appendix B we prove theorem 1. Let F be the function defined by: 

JV K 

F:C K ->R+ :x^ ]T ]J \\z n - x k \\l 
n=l k=l 

The restriction of F to M. K is the function J since the observations z n are real: 

Vx e R K : F(x) = J(x) (9) 



11 



Now let H be the function defined by: 

H:C K ^R+ :y^ ^|U*-z^y| 



N 



The function H applied to the ESP of a vector x in C k is equal to the function F applied 
to x: 



Vx G C K : F(x) = ^2 
?i=l 

developping l|10| l using definition J6} leads to: 

JV 

Vx 6 C A ' : F(x) = J2 

n=l 

including definitions Q and 171 : 

JV 

Vx£C K : F(x) = ^ 



(10) 



<w(x) 



Vx e 



F(x) = H(w(x)) 



The global minimum of H is the linear least square solution y min given by: 

JV 



Ymin = argmm 



El 

n=l 



_.A 



(11) 



(12) 



developping (1 1 2 I t using definitions (O and J3J and remembering that the coefficients of Z 
and z are real: 

y ml n = argmin { y H Zy - 2Re{y H }z| 



yec^ 



(13) 



The Hankel matrix Z is regular since the number of different observations is greater than 
K — 1 (appendix A). System (1 1 3 I t therefore has exactly one solution. Since Z belongs to 
R KxK and z belongs to R K , y mi „ belongs to R K . Now let x mi „=(a;i, mi „ , ■ ■ • ,2^, mm)' 
be a vector containing, in any order, the K (potentially complex) roots of q y . (a). One 
can show that the following holds: 

(i) x m i„ is a global minimum of F 

(ii) x min e R K 

(iii) x mm is a global minimum of J 
Property (i) is a direct consequence of lilt : 

Vx e C K : F(x) = H(w(x)) 

Vx 6 C K : F(x) > min {H} 
VxeC^: F(x) > H(y min ) 
According to J5J, y m ^„ = w(x min ) and we have: 

Vx G C K : F(x) > fl-(w(x roj „)) 

Vx 6 C K : F(x) > F(x mi „) 
which proves (i). Property (ii) can be shown by contradiction: if x m i„ does not belong to 



R K , then for one of the x^^min we have , 
z n are real: 

Vne {!,-■■ ,JV} 



7^ Re{a;( c m i n } and, since all the observations 
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which leads to: 

This is impossible since x m i„ is a global minimum of F. This proves property (ii). We finally 
have to prove (iii): since x m i„ G M. K we have, using l|9|l : 



(14) 



Furthermore, according to l9l l: 



Vx e R" 



K 



J(x) = F(x) 



Vx e r 



J(x) > min{F} 



then, according to property (i): 



Vx g R 



A' 



J(x) > F(x min ) 
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EM and KP performances on scenario A 



KP - scenario A.1 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 

Sigma 




0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1 

Sigma 



Fig. 2 performances of the EM and KP algorithms on scenario A for different values of <r. 
For each value of a and for each sub-scenario 10000 simulation run have been performed. 
For each simulation run, 100 observations have been generated. e r is the maximal distance 
between the sorted vector of true cluster means and the sorted vector of estimated cluster 
means. The performance criteria are the probabilities for e r to be smaller than 0.1 (top), 
smaller than 0.2 (middle) and greater than 0.5 (bottom). 
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EM and KP performances on scenario B 



— e — KP - scenario B.1 




0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 

sigma 




0.05 0.1 0.15 0.2 0.25 0.3 0.35 0.4 

sigma 



0.9 




sigma 



Fig. 3 performances of the EM and KP algorithms on scenario B for different values of a. 
For each value of a and for each sub-scenario 10000 simulation run have been performed. 
For each simulation run, 200 observations have been generated. e r is the maximal distance 
between the sorted vector of true cluster means and the sorted vector of estimated cluster 
means. The performance criteria are the probabilities for e r to be smaller than 0.1 (top), 
smaller than 0.2 (middle) and greater than 0.5 (bottom). 
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Fig. 4 performances of the EM and KP algorithms on scenario C for different values of 
<t. For each value of <r and for each sub-scenario 1000 simulation run have been performed. 
For each simulation run, 300 observations have been generated. e r is the maximal distance 
between the sorted vector of true cluster means and the sorted vector of estimated cluster 
means. The performance criteria are the probabilities for e r to be smaller than 0.1 (top), 
smaller than 0.2 (middle) and greater than 0.5 (bottom). 



